Note data can be found for instance at https://github.com/udacity/self-driving-car/tree/master/datasets published under MIT License.
The file is not distributed over the Dockerfile but you can download it and put it into HDFS.
In [1]:
%%bash
ls -tralFh /root/project/doc/el_camino_north.bag
In [2]:
%%bash
# same size, no worries, just the -h (human) formating differs in rounding
hdfs dfs -ls -h
Solved the issue https://github.com/valtech/ros_hadoop/issues/6
The issue was due to ByteBuffer being limitted by JVM Integer size and has nothing to do with Spark or how the RosbagMapInputFormat works within Spark. It was only problematic to extract the conf index with the jar.
Integer.MAX_SIZE is 2 GB !!
In [3]:
%%time
out = !java -jar ../lib/rosbaginputformat.jar -f /root/project/doc/el_camino_north.bag
In [4]:
%%bash
ls -tralFh /root/project/doc/el_camino_north.bag*
In [5]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sparkConf = SparkConf()
sparkConf.setMaster("local[*]")
sparkConf.setAppName("ros_hadoop")
sparkConf.set("spark.jars", "../lib/protobuf-java-3.3.0.jar,../lib/rosbaginputformat.jar,../lib/scala-library-2.11.8.jar")
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext
In [6]:
fin = sc.newAPIHadoopFile(
path = "hdfs://127.0.0.1:9000/user/root/el_camino_north.bag",
inputFormatClass = "de.valtech.foss.RosbagMapInputFormat",
keyClass = "org.apache.hadoop.io.LongWritable",
valueClass = "org.apache.hadoop.io.MapWritable",
conf = {"RosbagInputFormat.chunkIdx":"/root/project/doc/el_camino_north.bag.idx.bin"})
In [14]:
fin
Out[14]: